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Abstract. We study the online clustering problem where data items arrive in an online 
fashion. The algorithm maintains a clustering of data items into similarity classes. Upon 
arrival of v, the relation between v and previously arrived items is revealed, so that for 
each u we are told whether v is similar to u. The algorithm can create a new cluster for v 
and merge existing clusters. 

When the objective is to minimize disagreements between the clustering and the input, 
we prove that a natural greedy algorithm is 0(n)-competitive, and this is optimal. 

When the objective is to maximize agreements between the clustering and the input, we 
prove that the greedy algorithm is .5-competitive; that no online algorithm can be better 
than .834-competitive; we prove that it is possible to get better than 1/2, by exhibiting a 
randomized algorithm with competitive ratio .5+c for a small positive fixed constant c. 



1. Introduction 

We study online correlation clustering. In correlation clustering [21 [15], the input is 
a complete graph whose edges are labeled either positive, meaning similar, or negative, 
meaning dissimilar. The goal is to produce a clustering that agrees as much as possible 
with the edge labels. More precisely, the output is a clustering that maximizes the number 
of agreements, i.e., the sum of positive edges within clusters and the negative edges between 
clusters. Equivalently, this clustering minimizes the disagreements. This has applications 
in information retrieval, e.g. [8l llOj. 

In the online setting, vertices arrive one at a time and the total number of vertices is 
unknown to the algorithm a priori. Upon the arrival of a vertex, the labels of the edges that 
connect this new vertex to the previously discovered vertices are revealed. The algorithm 
updates the clustering while preserving the clusters already identified (it is not permitted to 
split any pre-existing cluster). Motivated by information retrieval applications, this online 
model was proposed by Charikar, Chekuri, Feder and Motwani [5] (for another clustering 
problem). As in [5], our algorithms maintain Hierarchical Agglomerative Clusterings at all 
times; this is well suited for the applications of interest. 
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The problem of correlation clustering was introduced by Ben-Dor et al. [3] to cluster 
gene expression patterns. Unfortunately, it was shown that even the offline version of 
correlation clustering is NP-hard [15\ |2] . The following are the two approximation problems 
that have been studied [21 [3 [1]: Given a complete graph whose edges are labeled positive 
or negative, find a clustering that minimizes the number of disagreements, or maximizes 
the number of agreements. We will call these problems MinDisAgree and MaxAgree 
respectively. Bansal et al. [2] studied approximation algorithms both for minimization 
and maximization problems, giving a constant factor algorithm for MinDisAgree, and 
a Polynomial Time Approximation Scheme (PTAS) for MaxAgree. Charikar et al. [7] 
proved that MinDisAgree is APX-hard and gave a factor 4 approximation. Ailon et al. 
[1] presented a randomized factor 2.5 approximation for MinDisAgree, which is currently 
the best known factor. The problem has attracted significant attention, with further work 
on several variants P El EH [El El [l2l E] . 

In this paper, we study online algorithms for MinDisAgree and MaxAgree. We 
prove that MinDisAgree is essentially hopeless in the online setting: the natural greedy 
algorithm is 0(n)-competitive, and this is optimal up to a constant factor, even with ran- 
domization (Theorem 13. 4p . The situation is better for MaxAgree: we prove that the 
greedy algorithm is a .5-competitive (Theorem 12. ip . but that no algorithm can be better 
than 0.803 competitive (0.834 for randomized algorithms, see Theorem 12. 2p . What is the 
optimal competitive ratio? We prove that it is better than .5 by exhibiting an algorithm 
with competitive ratio 0.5 + eo where eo is a small absolute constant (Theorem 12. 6p . Thus 
Greedy is not always the best choice! 

More formally, let vi, . . . ,Vn denote the sequence of vertices of the input graph, where 
n is not known in advance. Between any two vertices, vi and Vj for i ^ j, there is an 
edge labeled positive or negative. In MinDisAgree (resp. MaxAgree), the goal is to 
find a clustering C, i.e. a partition of the nodes, that minimizes the number of disagree- 
ments cost(C): the number of negative edges within clusters plus the number of positive 
edges between clusters (resp. maximizes the number of agreements profit(C): the number 
of positive edges within clusters plus the number of negative edges between clusters). Al- 
though these problems are equivalent in terms of optimality, they differ from the point of 
view of approximation. Let OPT denote the optimum solution of MinDisAgree and of 
MaxAgree. 

In the online setting, upon the arrival of a new vertex, the algorithm updates the 
current clustering: it may either create a new singleton cluster or add the new vertex to a 
pre-existing cluster, and may decide to merge some pre-existing clusters. It is not allowed 
to split pre-existing clusters. 

A c-competitive algorithm for MinDisAgree outputs, on any input cr, a clustering C{a) 
such that cost(C(cr)) < c • cost(OPT(cj)). For MaxAgree, we must have profit(C(cr)) > 
c • profit(OPT(o")). (When the algorithm is randomized, this must hold in expectation). 

2. Maximizing Agreements Online 

2.1. Competitiveness of Greedy 

For subsets of vertices S and T we define r(5, T) as the set of edges between S and T. 
We write r+(5', T) (resp. r~(5, T)) for the set of positive (resp. negative) edges of r(5, T). 
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We define the gain of merging S with T as the change in the profit when clusters S and T 
are merged: 

gain(5,r) = \r+{s,T)\ - \r-{s,T)\ = 2|r+(5,r)| - |5||t|. 

We present the following greedy algorithm for online correlation clustering. 

Algorithm 1 Algorithm Greedy 

1: Upon the arrival of vertex v do 

2: Put -y in a new cluster consisting of {v}. 

3: while there are two clusters C, C such that gain(C, C) > do 
4: Merge C and C 
5: end while 
6: end for 



Theorem 2.1. Let OPT denote the offline optimum. 

• For every instance, profit{GREEi)Y) > 0.5 profit{OPT). 

• There are instances with profit{GREEDY) < {0.5 + o{l))profit{OPT). 



2.2. Bounding the optimal competitive ratio 

Theorem 2.2. The competitive ratio of any randomized online algorithm for MaxAgree is 
at most 0.834. The competitive ratio of any deterministic online algorithm for MaxAgree 
is at most 0.803. 

The proof uses Yao's Min-Max Theorem [3] (maximization version). 

Theorem 2.3 (Yao's Min-Max Theorem). Fix a distribution D over a set of inputs {Ia)a- 
The competitive ratio of any randomized online algorithm is at most 

, Ej[profit{A{I))] , , ... ,. , , , , 

max-^ — — —-r-^ — : A deterministic online alqorithm\, 

^ Ei[profit{OPT{I))] ^ ^' 

where the expectations are over a random input I drawn from distribution D. 

To prove Theorem 2.2, we first define two generic inputs that we will use to apply 
Theorem 2.3. The first input is a graph Gi with 2m vertices and all positive edges between 
them The second input is a graph with 6m vertices defined as follows. The first 2m vertices 
have all positive edges between them, the next 2m vertices have all positive edges between 
them, and the last 2m vertices also have all positive edges between them. In each of these 
three sets Gi, G2, Gs of 2m vertices, half are labelled "left side" vertices and the other half 
are labelled "right side" vertices. All edges between left vertices are positive, but edges 
between a vertex u on the left side of Gi and a vertex v on the right side of Gj, j ^ i, are 
all negative. 

The online algorithm cannot distinguish between the two inputs until time 2m + 1, so 
it must hedge against two very different possible optimal structures. 
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2.3. Beating Greedy 

2.3.1. Designing the algorithm. Our algorithm is based on the observation that Algorithm 
Greedy always satisfies at least half of the edges. Thus, if profit(OPT) is less than (1 — 
a/2)\E\ for some constant a, then the profit of Greedy is better than half of optimal. 
We design an algorithm called Dense, parameterized by constants a and r, such that if 
profit(OPT) is greater than (1 — a/2)|£^|, then the approximation factor is at least 0.5+rj for 
some positive constant r]. We use both algorithms Greedy and Dense to define Algorithm 

m 

Theorem 2.4. Let a G (0, 1), t > 1 and rj G (0, ^) he such that 

7? < 1.5 - - ((2^3 + 9/2)aV4 + + a/2)2^^. (2.1) 

Then, for every instance such that OPT > (1 — a/2)E, Algorithm Dense^^,- has profit at 
least {1/2 + r]) OPT. 

Using Theorem 12.41 we can bound the competitive ratio of Algorithm [2j 

Corollary 2.5. Let a,T and r] be as above, and letp = a/{2 + 27/(2 — a)). Then Algorithm 
has competitive ratio at least ^ + j2 — 



+2»)(l-a/2) ■ 

Corollary 2.6. For a = 10"^^, r = 1.0946, r] = 0.0555 and p = 4,5- 10~^^, Algorithm\Eis 
1 

2 



^ + 2 • 10 ^^-competitive. 



Algorithm 2 A ^ + eo-competitive algorithm 
Given p, a, r, 

With probability 1 — p, run Greedy, 
With probability p, run DensEq^t-. 



Algorithm 3 Algorithm DensEq,^,- 

1: Let C = OPTi and for every cluster D £ C, let icepiCi{D) := D £ OPTi . 

2: Upon the arrival of a vertex v at time t do 

3: Put V in a, new cluster {v}. 

4: if t = ti for some i then 

5: for every cluster D in OPT, do 

6: Define a cluster D" obtained by merging the restriction of D to . . . ,ti} 

with every cluster C G C in {1, . . . such that reprj„^(C) is defined and is 

half-contained in D. 

7: If D" is not empty set reprj(i:>") := D £ OPT^. 

8: end for 

9: end if 
10: end for 



How do we define algorithm Dense? Using the PTAS of [2], one can compute offline a 
factor (1 — a/2) approximative solution OPT' of any instance of MaxAgree in polynomial 
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time. We will design algorithm Dense so that it guarantees an approximation factor of 
0.5 + r7 whenever profit(OPT') > {l-a)\E\. Since profit (OPT) > {l-a/2)\E\ implies that 
profit(OPT') > (1 - a)\E\, Theorem [23] will follow. 

We say that OPTj is large if profit(OPT^) > (1 — a)|£'|. We define a sequence {ti)i 
of update times inductively as follows: By convention to = 0. Time ti is the earliest time 
t > 100 such that OPTj is large. Assume ti is already defined, and let j be such that 
T^~^ < ti < tK If OPTI^j is large, then tj+i = , else ij+i is the earliest time t > such 
that OPT^ is large. Let ti,t2, ■ ■ ■ ,tK the resulting sequence. We will note, with an abuse 
of notation, OPT- instead of OPT^, for 1 < i < ET. 

We say that a cluster A is half- contained in B \i \ Ar\ B\ > \A\/2. Let e = a^^^. For 
each ti, we inductively define a near optimal clustering of the nodes For the base 

case, let OPTi be the clustering obtained from OPT'^ by keeping the largest clusters 
and splitting the other clusters into singletons. For the general case, to define OPTj given 
OPTj_i, mark the clusters of OPT^ as follows. For any D in OPT-, mark D if either one 
of the — 1/e largest clusters of OPTj_i is half-contained in D, or D is one of the 1/e 
largest clusters OPT^. Then OPTj contains all the marked clusters of OPT- and the rest 
of the vertices in [l,ti] as singleton clusters. (Note that, by definition, any OPTj contains 
at most non-singleton clusters; this will be useful in the analysis.) 

Note that Dense only depends on parameters a and r indirectly via the definition of 
update times and of OPT. 

2.3.2. Analysis: Proof of Theorem \2.4\ The analysis is by induction on i, assuming that we 
start from clustering OPTj at time ti, then apply the above algorithm from time ti to the 
final time t. If i = 1 this is exactly our algorithm, and ii i = K then this is simply OPTk] 
in general it is a mixture of the two constructions. 

More formally, define a forest J- (at time t) with one node for each ti < t and cluster of 
OPTj. The node associated to a cluster A of OPTj„i is a child of the node associated to a 
cluster B of OPTj if and only if A is half-contained in B. With a slight abuse of notation, 
we define the following clustering associated to the forest. There is one cluster T for each 
tree of the forest: for each node A of the tree, if i is such that A G OPTj, then cluster T 
contains ACi (tj_i,tj]. This defines T. 

One interpretation of Dense is that at all times t, there is an associated forest and 
clustering J^; and our algorithm Dense simply maintains it. See Figure [1] for an example. 

Lemma 2.7. Algorithm\Mis an online algorithm that outputs clustering T at time t. 

Let Ti be the forest obtained from T by erasing every node associated to clusters of 
OPTj for every j < i. With a slight abuse of notation, we define the following clustering 
J-j associated to that forest: there is one cluster C for each tree of the forest defined as 
follows. For each node A of the tree, let k > i be such that A G OPT^: then C contains 

{tk~i,tk] k > i, and C contains ^ if A; = i. This defines a sequence of clusterings such 
that = is the output of the algorithm, and Tk = OPT/^ . 
Lemma 2.8 (Main lemma). For any 2 < i < K , 

cost{Fi-i) - cost{Fi) < ( (4 + 2\/3)e + — ^ ) titx- 
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We defer the proof of Lemma 12.81 to next section. Assuming Lemma 12. 8^ we upper- 
bound the cost of clustering T . 

Lemma 2.9 (Lemma 14, [2]). For any < c < 1 and clustering C, let C be the clustering 
obtained from C by splitting all clusters of C of size less than cn, where n is the number of 
vertices. Then costiC') < cost{C) + cr? 12. 

Lemma 2.10. cost{T) < {{2^/3 + 9/2)e + j^, + e^/2)^t|.. 

Proof. We write: cost(J^) = cost(OPT/^) + ^f=2i^^^^i-^i-^) ~ cost(J^j)).By definition, 
OPTk contains the 1/e largest clusters of OPT^. Then the remaining clusters of OPT^ 
are of size at most etx- By Lemma [2. 9 1 the cost of OPTk is at most cost(OPT^)-|-et|^/2 < 
(a + e)t\/2. Applying Lemma 12.81 summing over 2 < i < K, we get 

cost(^) < (q + e)4/2 + ( (4 + 2\/3)e + j ^ titK- 

i 

By definition of the update times for any j > there exists at most one ti such that 
<ti< r^'+^ Let L be such that <tK < t^+^ Then 

V ti< V ti + tK< Y] T^+tK< -+tK< -tK- 

l<i<K l<i<K-l l<j<L 

Hence the desired bound on cost (J-"). ■ 
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Proof of Theorem \2.4\ Fix an input graph of size n, such that profit(OPT) > (1 - a/2) (2). 
By Lemma 12.10^ at time t^, Algorithm [3] has clustering T with cost(J-") < 0{e)^^^t'j^. 

By definition of the update times, n < rtx- To guarantee a competitive ratio of 0.5 + 77, 
for some r/, the cost must not exceed (0.5 — rj) (2) at time n, when all vertices + I, ■ ■ ■ ,n 
are added as singleton clusters. The number of new edges added to the graph between times 
tx and n is ("~2*^) + ixi^ — ix)- We must have 

^oW4 + (" V) + - ^ - ^) (2) ' ^^-^^ 

for some < < 0.5. Using the fact that n — tx < (t — and tx < n — 1, to satisfy 
(j2.2p . it suffices to have 

^_io(6)t| + t|,(r - 1)2/2 + (r - 1)4 < (0.5 - r?)4/2, 

which is equivalent to (|2.ip . Moreover we have the following natural constraints on constants 
T], e and r: < r/ < 0.5, < e < 1, and r > 1. Then, for any set of values of constants 77, e, 
r verifying those constraints, Algorithm Dense is 0.5 + ?7-competitive. ■ 



2.3.3. The core of the analysis: proof of Lemma \2.^ 

Lemma 2.11. Let S"^ he the set of vertices of the non-singleton clusters that are not among 
the l/e^ — 1/e largest clusters of OPTi^i. Then |cS*| < j^ti-i. 

Proof. Let C be a cluster of OPIVi, such that C C S\ Then \C\ < (l/e^ - l/e)-^ti_i. 
Since there are at most 1/e such clusters, the number of vertices of these are at most 

Notation 2.12. For any i 7^ j, and a cluster B of OPT^, we denote by 7^'' the square root 
of the number of edges of [1, tmin(ij)] x [1) imm(ij)]i adjacent to at least one node of B, and 
which are classified differently in OPT^ and in OPT^. 

We refer to non singleton clusters as large clusters. 

Lemma 2.13. Let 7~* be the set of vertices of those l/e^ — 1/e largest clusters of OPTi^i 
that are not half- contained in any cluster of OPT^. Then |T*| < V^'^Ziarge c&6pt~ ^ ^c** 

Let B he & cluster of OPTj. For any j < i, we define Cj{B) as the cluster associated 

with the tree of J-j that contains B. For any B G OPTj, we call Ci-i{B) the extension of 
Ci{B) to J-i-i- By definition of J^j, the following lemma is easy. 

Lemma 2.14. For any B G OPT^, the restriction of Ci^i{B) to (ti_-i,t_ft:] is equal to the 
restriction of Ci{B) to (ij_i,t/^]. 

Let denote the clusters of OPTj_i that are half-contained in B. We define 6^{B) 

as the symmetric difference of the restriction of B to [l,tj„i] and Uj^j-: 

6\B) = iBn[l,ti_i])AUjAj. 
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Lemma 2.15. For any cluster d of J-i, let C[ denote the extension of d to Ti-\. Then 

U Ci\C;c5*uru y b\B) 

Ci<^^i large B&OPfi 

Proof. By Lemma 12.141 the partition of the vertices is the same for Cj as for 

So Ci and C- only differ in the vertices of 

We will show that for a singleton cluster B of OPTj, 5^{B) is included in 5' U 
Uiarge BeOPT, '^'(^)' ^hich yields the lemma. 

Let B = {f} be a singleton cluster of OPTj such that S^{B) ^ {}■ A non-singleton 
cluster cannot be half-contained in a singleton cluster so we conclude no clusters are half- 
contained in B and hence S^{B) = {v}. By definition of 6^{B), v £ So there exists 

a cluster A of OPTj_i that contains v. Clearly A is not a singleton since otherwise S^{B) 
would be {}. There are two cases. 

First, if A is half-contained in a cluster B' ^ B oi OPTj then cluster B' is necessarily 
large since it contains more than one vertex of A. Then we have v S 6^{B'). 

Second, if A is not half-contained in any cluster of OPTj then A C 5* U 7~*. In fact, if 
A is half-contained in a cluster of OPT- which is split into singletons in OPTj, then A is 
not one of the — 1/e largest clusters of OPTj_i, and A C 5*. If A is not half-contained 
in any cluster of OPT-, then A C 7~* if ^4 is one of the — 1/e largest clusters of OPTj_i 
and ^ C 5* otherwise. ■ 

Lemma 2.16. For any large cluster B of OPTi, \6'{B)\ < 2^2^^' \ 

Proof. Let B' denote the restriction of B to [l,ti_i]. We first show that 

l/2{\U,A,\B'\f<{^'t'f- 

Observe that (7^*~^)^ includes all edges uv such that one of the following two cases occurs. 

First, ii u £ Aj\B and v S AjCiB: such edges are internal in the clustering OPT^„]^ but 
external in the clustering OPT^. The number of edges of this type is |ylj \ i?| • | n -B|. 
Since Aj is half-contained in B, this is at least \Aj \ B\'^. 

Second, li u € Aj Ci B and v € A^ Ci B with j 7^ k: such edges are external in the 
clustering OPTj„^ but internal in the clustering OPTj. The number of edges of this type 
is Ej<k \A, n B\ . \Ak nB\> E,<,. \Aj \ B\ • \ B\. 

Summing, it is easy to infer that (7)^^"^)^ > (1/2) (j2j \Aj \ B\^ = (1/2)| Uj Aj \ B'\'^. 

Let {A'j)j denote the clusters of OPTj„i that are not half-contained in B, but have non- 
empty intersections with B. We now show that 

l/2{\B'\U,A'^\f<{^'^~'f. 

We have B' \ UjAj = ^j{A'j n B). Observe that any A'j is a large cluster of OPTi_i, thus a 

cluster of OPT'^^^. Then (j^ )^ includes all ed ges uv such that one of the following two 
cases occurs 



ONLINE CORRELATION CLUSTERING 



581 



First, if n G ^^'^^ ^ ^ AjCiB: such edges are internal in the clustering OPT[_i but 

external in the clustering OPT^. The number of edges of this type is | A^- \ i?| • n 
Since A'j is not half-contained in B, this is at least \A'j n Sp. 

Second, li u € A'j Ci B and v £ A'f^ B with j ^ k: such edges are external in the 
clustering OPT^„i but internal in the clustering OPT-. The number of edges of this type 

is Ej<k\A'jr^B\.\A',nB\. 

Summing, we get 



iir'? > (1/2) Yl 14 n S| = {m\B' \ U,A 



I |2 



Lemma 2.17. For any i > 1, OPTi has at most non singleton clusters, all of which 
are clusters of OPT^ 

Proof. By definition, OPTi has at most non singleton clusters. For any i > 1, a cluster 
of OPTi_i can only be half-contained in one cluster of OPT^. Therefore given OPTj_i, at 
most clusters of OPT^ are marked. Thus OPTj has at most clusters. ■ 

We can now prove Lemma 12.81 

Proof of Lemma \2.8l By Lemma 12.141 clusterings J-i and J-i-i only differ in their partition 
of Then the set of the vertices that are classified differently in J^i and J-'i-i is 

UjCi \ Ci-i. Each of these vertices creates at most tx disagreements: 

cost(J"i_i) - cost(J"i) < ^ \Ci\Ci-i\tK ^2.3) 
By Lemmas [2T5] and IZTBl 

^ \a \ Q^i\tK < I 2^/2 I Yl "M + \^'\ + I'^l I (2.4) 

C'i^^i \ ^ large BeOFTi ^ / 

By Lemmas \2JT\ and [ZT3l 

|5i < ^— and iri < y -f^^'' (2.5) 

1 — e ^ 

large _B60PTi_i 

The term X]jg^j.gp bgopt~ i ^^^^^ norm of the vector (7^ ^'*)iarge B- Since 

OPTj_i has at most 1/e^ large clusters by Lemma |2.17| we can use Holder's inequality: 

^'b^'' = ll(7B~^'*)largc bIIi < l/e|| (Tb^'*) large sh- 

large BeOPTVi 

By definition we have ||(7ij '*)large B II 2 < ./2(cost(OPT^„i) + cost(OPT9). Thus 



Y 1b'' < l/e^2{attJ2 + aty2) < ^t,. (2.6) 



large BgOPT,„i 
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Similarly, we have 

2^ 7b < ^—ti. (2.7) 



Combining equations (j2.3p through (j2.7p and a = e yields 



e 

large BgOPT, 

4 



cost(J"i_i) - cost(J-i) < ( (4 + 2V3)e + ) ti^i^ 



3. Minimizing Disagreements Online 

Theorem 3.1. Algorithm Greedy is (2n + I) -competitive for MinDisAgree. 

To prove Theorem I3.H we need to compare the cost of the optimal clustering to the 
cost of the clustering constructed by the algorithm. The following lemma reduces this to, 
roughly, analyzing the number of vertices classified differently. 

Lemma 3.2. Let W and W' be two clusterings such that there is an injection Wl G W' — s- 
Wi G W. Then cost{W) - cost{W) < nj^,- \Wl \ Wi\. 

For subsets of vertices 5i,...,5m, we will write, with a slight abuse of notation, 
{Si, . . . , Sm) for the set of edges in T~^{Si,Sj) for any i 7^ j: r"'"(S'i, . . . , 5^) = 

yji^jT+{SuSj). 

Lemma 3.3. Let C be a cluster created by Greedy, and W = {Wi,...,Wk} denote 
the clusters of OPT. Then \C\ < max^ \C n Wi\ + 2|r+(C7 n VFi, . . . , C7 n Wk)\- We call 
io = arg max |C H Wi\ the leader of C. 

i 

Proof of Theorem \3.1[ Let C denote the clustering given by Greedy. For every cluster 
Wi of OPT, merge all the clusters of C that have i as their leaders. Let C = (Wl) be 
this new clustering. By definition of the greedy algorithm, this operation can only in- 
crease the cost since every pair of clusters have a negative-majority cut at the end of the 
algorithm: cost (C) < cost(C'). We apply Lemma [32] to W =OPT and W = C , and ob- 
tain: cost(C') < cost(OPT) + nXli \Wl \ Wi\. By definition of C we have \Wl \ Wi\ = 

:leadcr(C)= 

iEj^i\C^Wj\, hence 

El^Aw^d = E E \cnw,\. 

i C&C jyicader(C) 

By Lemma[331 Ejyieader(C) \C r\Wj\ <2\T+ {Cr\Wi, . . . , Cr\WK)\- Finally, to bound OPT 
from below, we observe that, for any two clusterings C and W, it holds that the sum over 
C G C of |r+(C n Wl, . . . ,Cn Wk)\ is less than cost(>V). Combining these inequalities 
yields the theorem. ■ 
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Theorem 3.4. Let ALG he a randomized algorithm for MinDisAgree. Then there exists 
an instance on which ALG has cost at least n — 1 — cost[OPT) where OPT is the offline 
optimum. If OPT is constant then cost(ALG) = Q(n) cost{OPT) . 

Proof. Consider two cliques A and B, each of size m, where all the internal edges of A and 
B are positive. Choose a vertex a in A, and a set of vertices 6i, . . . , 6^ in B. Define the 
edge labels of abi as positive, for all 1 < i < k and the rest of the edges between A and B 
as negative. Define an input sequence starting with a,bi, . . . , bi^, followed by the rest of the 
vertices in any order. 



References 

[1] Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information: ranking and 
clustering. In STOC '05: Proceedings of the thirty-seventh annual ACM symposium on Theory of com- 
puting, pages 684-693, New York, NY, USA, 2005. ACM Press. 

[2] Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation clustering. Mack. Learn., 56(l-3):89-113, 
2004. 

[3] Amir Ben-Dor, Ron Shamir, and Zohar Yakhini. Clustering gene expression patterns. Journal of Com- 
putational Biology, 6(3-4) :281-297, 1999. 

[4] Allan Borodin and Ran El-Yaniv. Online computation and competitive analysis. Cambridge University 
Press, New York, NY, USA, 1998. 

[5] Moses Charikar, Chandra Chekuri, Tomas Feder, and Rajeev Motwani. Incremental clustering and 
dynamic information retrieval. SIAM J. Comput., 33(6): 1417-1440, 2004. 

[6] Moses Charikar, Venkatesan Guruswami, and Anthony Wirth. Clustering with qualitative information. 
In foes, volume 00, page 524, Los Alamitos, CA, USA, 2003. IEEE Computer Society. 

[7] Moses Charikar, Venkatesan Guruswami, and Anthony Wirth. Clustering with qualitative information. 
J. Comput. Syst. ScL, 71(3):360-383, 2005. 

[8] William W. Cohen and Jacob Richman. Learning to match and cluster large high-dimensional data sets 
for data integration. In KDD '02: Proceedings of the eighth ACM SICKDD international conference on 
Knowledge discovery and data mining, pages 475-480, New York, NY, USA, 2002. ACM. 

[9] Erik D. Demaine, Dotan Emanuel, Amos Fiat, and Nicole Immorlica. Correlation clustering in general 
weighted graphs. Theor. Comput. Sci., 361(2):172-187, 2006. 
[10] Jenny Rose Finkel and Christopher D. Manning. Enforcing transitivity in coreference resolution. In 
Proceedings of ACL-08: HLT, Short Papers, pages 45-48, Columbus, Ohio, June 2008. Association for 
Computational Linguistics. 
[11] loannis Giotis and Venkatesan Guruswami. Correlation clustering with a fixed number of clusters. 

Theory of Computing, 2(l):249-266, 2006. 
[12] Thorsten Joachims and John Hopcroft. Error bounds for correlation clustering. In ICML '05: Proceed- 
ings of the 22nd international conference on Machine learning, pages 385-392, New York, NY, USA, 
2005. ACM. 

[13] Marek Karpinski and Warren Schudy. Linear time approximation schemes for the Gale-Berlekamp game 

and related minimization problems. In STOC '09: Proceedings of the 41st annual ACM symposium on 

Theory of computing, pages 313-322, 2009. 
[14] Claire Mathieu and Warren Schudy. Correlation clustering with noisy input. In To appear in Procs. 

21"^ SODA, preprint: http://www.es. brown. edu/'^ws/papers/cluster.pdf, 2010. 
[15] Ron Shamir, Roded Sharan, and Dekel Tsur. Cluster graph modification problems. Discrete Appl. 

Math., 144(1-2) :173-182, 2004. 



584 



C. MATHIEU, O. SANKUR, AND W. SCHUDY 



This work is licensed und er the Creative Cprnmons Attribution-NoDerivs License . To view a 

copy of this license, visit http : //creativecommons . org/licenses/by-nd/3 .0/ 1 



